Back

IEEE Transactions on Computational Biology and Bioinformatics

Institute of Electrical and Electronics Engineers (IEEE)

Preprints posted in the last 90 days, ranked by how well they match IEEE Transactions on Computational Biology and Bioinformatics's content profile, based on 17 papers previously published here. The average preprint has a 0.02% match score for this journal, so anything above that is already an above-average fit.

1
DVPNet: A New XAI-Based Interpretable Genetic Profiling Framework Using Nucleotide Transformer and Probabilistic Circuits

Kusumoto, T.

2026-01-30 bioinformatics 10.64898/2026.01.28.695053 medRxiv
Top 0.1%
9.2%
Show abstract

In this study, we present an XAI-based genetic profiling framework that quantifies gene importance for distinguishing cancer cells from normal cells based on an interpretable AI decision process. We propose a new explainable AI (XAI) classification model that combines probabilistic circuits with the Nucleotide Transformer. By leveraging the strong feature-extraction capability of the Nucleotide Transformer, we design a tractable classification framework based on probabilistic circuits while preserving probabilistic interpretability. To demonstrate the capability of this framework, we used the GSE131907 single-cell lung cancer atlas and constructed a dataset consisting of cancer-cell and normal-cell classes. From each sample, 900 gene types were randomly selected and converted into embedding vectors using the Nucleotide Transformer, after which the classification model was trained. We then extracted class-specific probabilistic contributions from the tractable model and defined a contribution score for the cancer-cell class. Genetic profiling was performed based on these scores, providing insights into which genes and biological pathways are most important for the classification task. Notably, 1,524 of the 9,540 observed genes showed contribution scores that contradicted what would be expected from their class-wise occurrence frequencies, suggesting that the profiling goes beyond simple statistics by leveraging biological feature representations encoded by the Nucleotide Transformer. The top-ranked genes among these contradictory cases include several well-studied genes in cancer research (e.g., ITGA5, SIGLEC9, NOTUM, and TP73). Overall, these analyses go beyond traditional statistical or gene-expression-level approaches and provide new academic insights for genetic research.

2
From Circles to Signals: Representation Learning on Ultra-Long Extrachromosomal Circular DNA

Li, J.; Liu, Z.; Zhang, Z.; Zhang, J.; Singh, R.

2026-03-17 bioinformatics 10.1101/2025.11.22.689941 medRxiv
Top 0.1%
6.2%
Show abstract

Extrachromosomal circular DNA (eccDNA) is a covalently closed circular DNA molecule that plays an important role in cancer biology. Genomic foundation models have recently emerged as a powerful direction for DNA sequence modeling, enabling the direct prediction of biologically relevant properties from DNA sequences. Although recent genomic foundation models have shown strong performance on general DNA sequence modeling, their application to eccDNA remains limited: existing approaches either rely on computationally expensive attention mechanisms or truncate ultra-long sequences into kilobase fragments, thereby disrupting long-range continuity and ignoring the molecules circular topology. To overcome these problems, we introduce eccDNAMamba, a bidirectional state space model (SSM) built upon the Mamba-2 framework, which scales linearly with input sequence length and enables scalable modeling of ultra-long eccDNA sequences. eccDNAMamba further incorporates a circular augmentation strategy to preserve the intrinsic circular topology of eccDNA. Comprehensive evaluations against state-of-the-art genomic foundation models demonstrate that eccDNAMamba achieves superior performance on ultra-long sequences across multiple task settings, such as cancer versus healthy eccDNA discrimination and eccDNA copy-number level prediction. Moreover, the Integrated Gradient (IG) based model explanation indicates that eccDNAMamba focuses on biologically meaningful regulatory elements and can uncover key sequence patterns in cancer-derived eccDNAs. Overall, these results demonstrate that eccDNAMamba effectively models ultra-long eccDNA sequences by leveraging their unique circular topology and regulatory architecture, bridging a critical gap in sequence analysis. Our codes and datasets are available at https://github.com/zzq1zh/eccDNAMamba.

3
Joint Learning of Drug-Drug Combination and Drug-DrugInteraction via Coupled Tensor-Tensor Factorization with SideInformation

Zhang, X.; Fang, Z.; Tang, K.; Chen, H.; Li, J.

2026-02-16 bioinformatics 10.64898/2026.02.13.705792 medRxiv
Top 0.1%
4.0%
Show abstract

Targeted drug therapies offer a promising approach for treating complex diseases, with combinational drug therapies often employed to enhance therapeutic efficacy. However, unintended drug-drug interactions may undermine treatment outcomes or cause adverse side effects. In this work, we propose a novel joint learning framework for the simultaneous prediction of effective drug combinations and drug-drug interactions, based on coupled tensor-tensor factorization. Specifically, we model drug combination therapies and DDI by representing drug-drug-disease associations and drug-drug interaction profiles as coupled three-way tensors. To address the challenges of data incompleteness and sparsity, the proposed model integrates auxiliary drug similarity information, such as chemical structure similarities, drug-specific side effects, drug target profiles, and drug inhibition data on cancer cell lines, within a multi-view learning frame-work. For optimization, we adopt a modified Alternating Direction Method of Multipliers (ADMM) algorithm that ensures convergence while enforcing non-negativity constraints. In addition to standard tensor completion tasks, we further evaluate the proposed method under a more realistic new-drug prediction setting, where all interactions involving a previously unseen drug are withheld. This scenario closely aligns with real-world applications, in which reliable predictions for emerging or under-studied compounds are essential. We evaluate the proposed method on a comprehensive dataset compiled from multiple sources, including DrugBank, CDCDB, SIDER, and PubChem. Our experiments show that SI-ADMM maintains robust performance and achieves the best results comparing to other tensor factorization approaches, with or without auxiliary information, particularly in the new-drug prediction setting. The implementation of our method is publicly available at: https://github.com/Xiaoge-Zhang/SI-ADMM.

4
Mutation-centric Network Construction using Long-Range Interactions

Huseynov, R.; Otlu, B.

2026-03-18 bioinformatics 10.64898/2026.03.16.712158 medRxiv
Top 0.1%
3.9%
Show abstract

Somatic mutations can alter normal cells and lead to cancer development. Yet distinguishing functional driver mutations from neutral passenger mutations remains a significant challenge. Traditional genomic tools often prioritize linear overlap searches, failing to capture the complex, three-dimensional regulatory environment of the genome. We present a graph-based framework, MutationNetwork, for constructing mutation-centric networks by integrating long-range intrachromosomal interactions with local genomic overlaps. Our method utilizes a unique positive and negative indexing scheme to represent interacting genomic intervals as nodes. By encoding both interactions and overlaps as edges, we enable constant-time retrieval of complex relationship data. By iteratively expanding the graph from a seed mutation, we can quantify a mutations influence on the genomic landscape and assess its proximity to genes. We applied this framework to a dataset of 560 breast cancer whole-genome sequences, focusing on Triple-Negative Breast Cancer (TNBC) and Luminal A subtypes. Our results demonstrate that the generated mutation embeddings successfully cluster samples according to their biological subtypes, with the highest classification performance achieved at specific ranges. This approach provides a comprehensive view of mutation impact, offering a scalable solution for cancer patient stratification and the prioritization of potential non-coding driver mutations by assessing their network-level impact. Availability and implementationThe source code is available at https://github.com/Ramalh/MutationNetwork

5
DisGeneFormer: Precise Disease Gene Prioritization by Integrating Local and Global Graph Attention

Koeksal, R.; Fritz, A.; Kumar, A.; Schmidts, M.; Tran, V. D.; Backofen, R.

2026-03-14 bioinformatics 10.64898/2026.03.11.711106 medRxiv
Top 0.1%
3.7%
Show abstract

Identifying genes associated with human diseases is essential for effective diagnosis and treatment. Experimentally identifying disease-causing genes is time-consuming and expensive. Computational prioritization methods aim to streamline this process by ranking genes based on their likelihood of association with a given disease. However, existing methods often report long ranked lists consisting of thousands of potential disease genes, often containing a high number of false positives. This fails to meet the practical needs of clinicians who require shorter, more precise candidate lists. To address this problem, we introduce DisGeneFormer (DGF), an end-to-end disease-gene prioritization pipeline. Our approach is based on two distinct graph representations, modeling gene and disease relationships, respectively. Each graph is first processed separately by graph attention and then jointly by a transformer module to combine within-graph and cross-graph knowledge through local and global attention. We propose an evaluation pipeline based on the precision of a top K ranked gene list, with K set to clinically feasible values between 5 and 50, relying solely on experimentally verified associations as ground truth. Our evaluation demonstrates that DGF substantially outperforms existing methods. We additionally assessed the influence of the negative data sampling strategy as well as analyses of the effect of graph topology and features on the performance of our model.

6
VarDCL: A Multimodal PLM-Enhanced Framework for Missense Variant Effect Prediction via Self-distilled Contrastive Learning

Zhang, H.; Zheng, G.; Xu, Z.; Zhao, H.; Cai, S.; Huang, Y.; Zhou, Z.; Wei, Y.

2026-03-17 bioinformatics 10.64898/2026.03.13.711612 medRxiv
Top 0.1%
2.9%
Show abstract

Missense variants are a common type of genetic mutation that can alter the structure and function of proteins, thereby affecting the normal physiological processes of organisms. Accurately distinguishing damaging missense variants from benign ones is of great significance for clinical genetic diagnosis, treatment strategy development, and protein engineering. Here, we propose the VarDCL method, which ingeniously integrates multimodal protein language model embeddings and self-distilled contrastive learning to identify subtle sequence and structural differences before and after protein mutations, thereby accurately predicting pathogenic missense variants. First, leveraging sequence and structural information before and after mutations, VarDCL generates sequence-structural multimodal features via different language models. It incorporates both global and local perspectives of feature embeddings to provide the model with dynamic, multimodal, and multi-view input data. Additionally, a Self-distilled Contrastive Learning (SDCL) module was proposed to enable more effective information integration and feature learning, enhancing the models ability to detect sequence and structural changes induced by mutations. Within this module, the multi-level contrastive learning framework excels at capturing information differences before and after mutations within the same modality; meanwhile, the feature self-distillation mechanism effectively utilizes high-level fused features to guide the learning of low-level differential features, facilitating information interaction across different modalities. The VarDCL framework not only ensures the models capacity to learn dynamic changes pre- and post-mutation but also significantly improves cross-modal information interaction between sequence and structure, thereby remarkably boosting the models performance in distinguishing pathogenic mutations from benign ones. To validate the effectiveness of VarDCL, extensive experiments were conducted. The ablation study demonstrates that all key components of VarDCL contribute significantly. On an independent test set containing 18,731 clinical variants, VarDCL achieved an AUC of 0.917, an AUPR of 0.876, an MCC of 0.690, and an F1-score of 0.789, outperforming 21 state-of-the-art existing methods. Benchmark analysis shows that VarDCL can be utilized as an accurate and potent tool for predicting missense variant effects.

7
HiCInterpolate: 4D Spatiotemporal Interpolation of Hi-C Data for Genome Architecture Analysis.

Chowdhury, H. M. A. M.; Oluwadare, O.

2026-02-09 bioinformatics 10.64898/2026.02.06.704438 medRxiv
Top 0.1%
2.6%
Show abstract

MotivationStudying the three-dimensional (3D) structure of a genome, including chromatin loops and Topologically Associating Domains (TADs), is essential for understanding how the genome is organized, such as gene activation, cell development, protein-protein interaction, etc. Hi-C protocol enables us to study 3D genome structure and organization. Chromatin 3D structure changes dynamically over time, and modeling these continuous changes is crucial for downstream analysis in various domains such as disease diagnosis, vaccine development, etc. The high expense and impracticality of continuous genome sequencing, particularly what evolves between two timestamps, limit the most effective genomic analysis. It is crucial to develop a straightforward and cost-efficient method for constantly generating genomic data between two timestamps in order to address these constraints. ResultsIn this study, we developed HiCInterpolate, a 4D spatiotemporal interpolation architecture that accepts two timestamp Hi-C contact matrices to interpolate intermediate Hi-C contact matrices at high resolution. HiCInterpolate predicts the intermediate Hi-C contact map using a deep learning-based flow predictor, and a feature encoder and decoder architecture similar to U-Net. In addition, HiCInterpolate supports downstream analysis of multiple 3D genomic features, including A/B compartments, chromatin loops, TADs, and 3D genome structure, through an integrated analysis pipeline. Across multiple evaluation metrics, including PSNR, SSIM, GenomeDISCO, HiCRep, and LPIPS, HiCInterpolate achieved consistently strong performance. Biological validation further demonstrated preservation of key chromatin organization features, such as chromatin loops, A/B compartments, and TADs. Together, these results indicate that HiCInterpolate provides a robust computer vision-based framework for high-resolution interpolation of intermediate Hi-C contact matrices and facilitates biologically meaningful downstream analyses. AvailabilityHiCInterpolate is publicly available at https://github.com/OluwadareLab/HiCInterpolate.

8
Solving the Diagnostic Odyssey with Synthetic Phenotype Data

Colangelo, G.; Marti, M.

2026-03-23 bioinformatics 10.64898/2026.03.19.712946 medRxiv
Top 0.1%
2.5%
Show abstract

The space of possible phenotype profiles over the Human Phenotype Ontology (HPO) is combinatorially vast, whereas the space of candidate disease genes is far smaller. Phenotype-driven diagnosis is therefore highly non-bijective: many distinct symptom profiles can correspond to the same gene, but only a small fraction of the theoretical phenotype space is biologically and clinically plausible. When a structured ontology exists, this constraint can be exploited to generate realistic synthetic cases. We introduce GraPhens, a simulation framework that uses gene-local HPO structure together with two empirically motivated soft priors, over the number of observed phenotypes per case and phenotype specificity, to generate synthetic phenotype-gene pairs that are novel yet clinically plausible. We use these synthetic cases to train GenPhenia, a graph neural network that reasons over patient-specific phenotype subgraphs rather than flat phenotype sets. Despite being trained entirely on synthetic data, GenPhenia generalizes to real, previously unseen clinical cases and outperforms existing phenotype-driven gene-prioritization methods on two real-world datasets. These results show that when patient-level data are scarce but a structured ontology is available, principled simulation can provide effective training data for end-to-end neural diagnosis models.

9
Prioritizing DNA methylation biomarkers using graph neural networks and explainable AI

Kumar, A.; Do, T. A.; Gruening, B.; Becker, H.; Backofen, R.

2026-01-27 bioinformatics 10.64898/2026.01.26.701692 medRxiv
Top 0.1%
2.2%
Show abstract

DNA methylation is a significant epigenetic modification involving the addition of a methyl group to the position 5' of the cytosine residues. The modification is responsible for disease progression, immune response, and outcomes in diseases such as breast cancer (BC) and acute myeloid leukemia (AML). Illuminas HumanMethylation450 BeadChip (450K) and EPIC BeadChip (850K) methylation arrays are heavily used for such cancer studies to determine differentially expressed and differentially methylated genomic regions. Many of these are biomarkers used effectively for exploring therapeutic targets. Several studies report a few potential biomarkers, but the enormous numbers of largely unexplored probe-level (CpG sites) methylation signals may contain additional significant biomarkers. To prioritise the under-explored and disease-specific CpG sites from DNA methylation arrays and potentially uncover novel biomarkers, we present the novel approach GraphMeX-plain, a graph neural network (GNN)-based approach with explainable AI module. The underlying graph neural network is a principal neighbourhood aggregation (PNA). The approach uses the biomarkers reported in recent studies to rank biomarkers from the unexplored set. A similarity graph between CpG sites (known and unexplored sets) is constructed using DNA methylation {beta} values from arrays, producing an interaction graph. Biomarkers from recent studies are used as seeds and from the unexplored CpG sites, highly-variable ones (excluding the seeds) are selected that vary significantly between conditions (BC patients and normal controls for breast cancer arrays). Using the combination of seed and highly-variable CpG sites, a positive-unlabeled approach, network-informed adaptive positive-unlabeled learning (NIAPU), is utilized to assign a set of soft labels to unknown CpG sites such as likely positive, weakly negative, likely negative, and reliable negative in the descending order of likelihood of CpG sites being potential biomarkers. The graph neural network, a multi-layer PNA, refines the soft label assignments and achieves a high F1 classification score (weighted) of 0.93 for BC and 0.91 for AML. The most likely set of CpG sites, classified under "likely positive", are further explored using GNNExplainer, an explainable AI approach. Subgraphs for likely positive CpG sites, predicted with high probabilities, are computed and their proximities to the original seed CpG sites are analysed. The CpG sites which are predicted as likely positives have close interactions to the seeds. The top likely positive CpG site for BC is cg13265740 (C6orf115) where gene C6orf115 is strongly associated with BC. For AML, the top likely positive predicted CpG site is cg23281527 (KLHDC7A) where gene KLHDC7A plays a strong role in the mechanism of AML. A high percentage of these likely positive predicted CpG sites for both BC and AML, which remained unseen by the GNN model during training, are highly relevant to them and can serve as potential therapeutic targets and prognostic values.

10
Adaptive Integration of Heterogeneous Foundation Models to Find Histologically Predictable Genes in Breast Cancer

Nguyen, H.; Li, C.; Peng, C.; Simpson, P.; Ye, N.; Nguyen, Q.

2026-04-08 bioinformatics 10.64898/2026.04.05.716435 medRxiv
Top 0.1%
2.1%
Show abstract

Foundation models for computational pathology have rapidly emerged as powerful tools for extracting rich biological and morphological representations from histopathology images. However, variations in model architecture, pre-training data, and optimization objectives often lead to task-dependent performance, rather than universal generalization. As a result, effective strategies for integrating their complementary strengths are essential to fully realize the potential of foundation models for robust histopathology analysis. Meanwhile, recent breakthroughs such as spatial transcriptomics provide an unprecedented opportunity to integrate genetic and histopathology information from the same patient sample, thereby maximizing both molecular and anatomical pathology insights. Specifically, each models embedding is first mapped to gene-level predictions via a dedicated prediction head, enabling model-specific feature utilization. A lightweight weighting network then adaptively aggregates these predictions to produce a unified and robust output at gene and spatial location levels. Across multiple spatial transcriptomics datasets, our approach consistently outperforms both individual foundation models and classical ensembling methods. Focusing on breast cancer, we observe substantial gains in prediction accuracy for clinically relevant PAM50 subtype markers and drug-target genes. Moreover, the proposed framework improves interpretability by revealing model-specific contributions and specialization at the gene level. Overall, our work presents an effective solution to integrating multiple foundation models for enhancing the genetic analyses of histopathology images.

11
Differential Network-Based Causal Graph Learning for Cardiovascular Recurrence Risk Prediction and Factor Discovery

Zhou, M.; Zhang, M.; Wang, J.; Shao, C.; Yan, G.

2026-03-18 cardiovascular medicine 10.64898/2026.03.16.26348547 medRxiv
Top 0.1%
2.1%
Show abstract

Cardiovascular disease is one of the leading causes of death worldwide, with myocardial infarction (MI) being a major cause of both morbidity and mortality among cardiovascular patients. MI Patients face a higher risk of cardiovascular disease recurrence afterwards. Therefore, accurately predicting the risk of recurrence and identifying key risk factors are crucial for clinical decision-making. In this paper, we consider the interrelationships among cardiovascular factors from a systemic perspective. We first construct a differential network for each patient to capture individual-specific deviations in factor relationships and propose a novel method, termed Causal Factor-aware Graph Neural Network (CFGNN), which integrates factor interactions to predict the recurrence risk of MI patients while uncovering key risk factors from a causal perspective. Experimental results demonstrate that CFGNN performs well on hospital-derived datasets in real world, effectively identifying several key risk factors. This method not only deepens our understanding of cardiovascular disease, but also paves the way for more targeted and effective interventions.

12
Federated penalized piecewise exponential model for horizontally distributed survival data: FedPPEM

Islam, N.; Luo, C.; Tong, J.; Polleya, D. A.; Jordan, C. T.; Haverkos, B.; Bair, S.; Kent, A.; Weller, G.

2026-02-12 health informatics 10.64898/2026.02.11.26346054 medRxiv
Top 0.2%
1.9%
Show abstract

Cox proportional hazard regressions are frequently employed to develop prognostic models for time-to-event data, considering both patient-specific and disease-specific characteristics. In high-dimensional clinical modeling, these biological features can exhibit high collinearity due to inter-feature relationships, potentially causing instability and numerical issues during estimation without regularization. For rare diseases such as acute myeloid leukemia (AML), the sparsity and scarcity of data further complicate estimation. In such cases, data augmentation through multi-site collaboration can alleviate these problems. However, this often necessitates sharing individual patient data (IPD) across sites, which presents challenges due to regulatory barriers aimed at protecting patient privacy. To overcome these challenges, we propose a privacy-preserving algorithm that eliminates sharing IPD across sites and fits a federated penalized piecewise exponential model (FedPPEM) to estimate potential effects of clinical features using summary statistics. This algorithm yields results nearly identical to those from pooled IPD, including effect size and standard error estimates. We demonstrate the models performance in quantifying effects of clinical features and genetic risk classification on overall survival using real-world data from [~]1200 newly diagnosed AML patients across 33 U.S. sites. Although applied in AML context, this model is disease-agnostic and can be implemented in other diseases and clinical contexts.

13
Identifying genes associated with phenotypes using machine and deep learning

Muneeb, M.; Ascher, D.

2026-03-07 bioinformatics 10.64898/2026.03.05.709665 medRxiv
Top 0.2%
1.8%
Show abstract

Identifying disease-associated genes enables the development of precision medicine and the understanding of biological processes. Genome-wide association studies (GWAS), gene expression data, biological pathway analysis, and protein network analysis are among the techniques used to identify causal genes. We propose a machine-learning (ML) and deep-learning (DL) pipeline to identify genes associated with a phenotype. The proposed pipeline consists of two interrelated processes. The first is classifying people into case/control based on the genotype data. The second is calculating feature importance to identify genes associated with a particular phenotype. We considered 30 phenotypes from the openSNP data for analysis, 21 ML algorithms, and 80 DL algorithms and variants. The best-performing ML and DL models, evaluated by the area under the curve (AUC), F1 score, and Matthews correlation coefficient (MCC), were used to identify important single-nucleotide polymorphisms (SNPs), and the identified SNPs were compared with the phenotype-associated SNPs from the GWAS Catalog. The mean per-phenotype gene identification ratio (GIR) was 0.84. These results suggest that SNPs selected by ML/DL algorithms that maximize classification performance can help prioritise phenotype-associated SNPs and genes, potentially supporting downstream studies aimed at understanding disease mechanisms and identifying candidate therapeutic targets.

14
Benchmark of biomarker identification and prognostic modeling methods on diverse censored data

Fletcher, W. L.; Sinha, S.

2026-04-01 bioinformatics 10.64898/2026.03.29.715113 medRxiv
Top 0.2%
1.8%
Show abstract

The practices of identifying biomarkers and developing prognostic models using genomic data has become increasingly prevalent. Such data often features characteristics that make these practices difficult, namely high dimensionality, correlations between predictors, and sparsity. Many modern methods have been developed to address these problematic characteristics while performing feature selection and prognostic modeling, but a large-scale comparison of their performances in these tasks on diverse right-censored time to event data (aka survival time data) is much needed. We have compiled many existing methods, including some machine learning methods, several which have performed well in previous benchmarks, primarily for comparison in regards to variable selection capability, and secondarily for survival time prediction on many synthetic datasets with varying levels of sparsity, correlation between predictors, and signal strength of informative predictors. For illustration, we have also performed multiple analyses on a publicly available and widely used cancer cohort from The Cancer Genome Atlas using these methods. We evaluated the methods through extensive simulation studies in terms of the false discovery rate, F1-score, concordance index, Brier score, root mean square error, and computation time. Of the methods compared, CoxBoost and the Adaptive LASSO performed well in all metrics, and the LASSO and elastic net excelled when evaluating concordance index and F1-score. The Benjamini-Hoschberg and q-value procedures showed volatile performances in controlling the false discovery rate. Some methods performances were greatly affected by differences in the data characteristics. With our extensive numerical study, we have identified the best performing methods for a plethora of data characteristics using informative metrics. This will help cancer researchers in choosing the best approach for their needs when working with genomic data.

15
TPCAV: Interpreting deep learning genomics models via concept attribution

Yang, J.; Mahony, S.

2026-01-21 bioinformatics 10.64898/2026.01.20.700723 medRxiv
Top 0.2%
1.7%
Show abstract

Interpreting genomics deep learning models remains challenging. Existing feature attribution methods largely focus on scoring individual bases or extracting global DNA motifs from one-hot encoded inputs, leaving them unable to assess broader genomic features such as chromatin accessibility or sequence annotations. Concept attribution methods offer an input-agnostic global interpretation framework, yet they have not been systematically applied to interpret neural network applications in genomics. We present the first application of concept attribution to interpret genomics deep learning models by adapting the Testing with Concept Activation Vectors (TCAV) method. We introduce Testing with PCA-projected Concept Activation Vectors (TPCAV), which improves upon the original method by using a PCA-based decorrelation transformation to address the correlated and redundant embedding features common in genomics models. We also introduce a strategy for extracting concept-specific input attribution maps. We evaluate our approach by interpreting influential biological concepts across a diverse set of genomics models spanning multiple input representations and prediction tasks. We demonstrate that TPCAV provides more reliable DNA motif interpretation than TCAV and is comparable to TF-MoDISco on one-hot coded DNA-based transcription factor binding prediction models. Beyond motif interpretation, TPCAV enables robust interpretive analysis of more general concepts such as repetitive elements and chromatin accessibility and generalizes to tokenized foundation models as well as models incorporating chromatin signal inputs. We further show that TPCAV can identify representative transcription factor binding sites associated with specific concepts, motivating downstream investigation of distinct binding mechanisms. Overall, TPCAV provides a flexible and robust complement to existing model interpretation techniques.

16
CLEAR-HPV: Interpretable Concept Discovery for HPV-Associated Morphology in Whole-Slide Histology

Liu-Swetz, Y.; Tan, S.; Qin, W.; Wang, H.

2026-02-06 bioinformatics 10.64898/2026.02.04.703870 medRxiv
Top 0.2%
1.7%
Show abstract

Human papillomavirus (HPV) status is a critical determinant of prognosis and treatment response in head and neck and cervical cancers. Although attention-based multiple instance learning (MIL) achieves strong slide-level prediction for HPV-related whole-slide histopathology, it provides limited morphologic interpretability. To address this limitation, we introduce Concept-Level Explainable Attention-guided Representation for HPV (CLEAR-HPV), a framework that restructures the MIL latent space using attention to enable concept discovery without requiring concept labels during training. Operating in an attention-weighted latent space, CLEAR-HPV automatically discovers keratinizing, basaloid, and stromal morphologic concepts, generates spatial concept maps, and represents each slide using a compact concept-fraction vector. CLEAR-HPVs concept-fraction vectors preserve the predictive information of the original MIL embeddings while reducing the high-dimensional feature space (e.g., 1536 dimensions) to only 10 interpretable concepts. CLEAR-HPV generalizes consistently across TCGA-HNSCC, TCGA-CESC, and CPTAC-HNSCC, providing compact, concept-level interpretability through a general, backbone-agnostic framework for attention-based MIL models of whole-slide histopathology.

17
gSV: a general structural variant detector using the third-generation sequencing data

HAO, J.; Shi, J.; Lian, S.; Zhang, Z.; Luo, Y.; Hu, T.; Ishibashi, T.; Wang, D.; Wang, S.; Fan, X.; Yu, W.

2026-03-04 bioinformatics 10.64898/2026.03.02.703663 medRxiv
Top 0.2%
1.7%
Show abstract

Structural variants (SVs) are major contributors to genome diversity and disease susceptibility, particularly in cancer. Although third-generation sequencing technologies have substantially improved SV detection sensitivity, accurate detection of complex SVs remains challenging due to fragmented and heterogeneous alignment signals, as well as the dependence of many existing methods on predefined variant models. In this paper, we propose gSV, a general SV detector that integrates alignment-based and assembly-based approaches with the maximum exact match (MEM) strategy, with particular emphasis on resolving SVs with complex or atypical alignment signatures. Without predefined assumptions about SV types, gSV captures diverse variant signals, enabling the detection of SVs that are usually missed by conventional tools. Benchmarking using both simulated datasets and real long-read sequencing data demonstrates that gSV achieves improved sensitivity and overall detection performance compared with current state-of-the-art SV callers, particularly for simple and complex SV events with complex alignment patterns. Unique SV discoveries in four breast cancer cell lines, particularly in cancer-associated genes, demonstrate the potential biological relevance of gSV-enabled discoveries. Furthermore, analysis of a breast cancer cohort from the Chinese population highlights the utility of gSV for population-scale genomic studies. Collectively, gSV provides a unified framework for comprehensive SV discovery in both research and clinical genomics settings. O_TEXTBOXKey PointsO_LIExisting structural variant (SV) detection tools are limited in resolving SVs with complex alignment patterns due to their reliance on predefined variant models. C_LIO_LIgSV integrates alignment-based and assembly-based evidence using a maximum exact match (MEM) strategy, enabling capture of diverse and complex SV signals. C_LIO_LIBenchmarking on simulated and real long-read sequencing datasets demonstrates that gSV achieves competitive performance on canonical SV classes and improved sensitivity for complex SV patterns. C_LIO_LIApplication of gSV to breast cancer cell lines and a population-scale breast cancer cohort reveals previously unresolved SVs in cancer-associated genes, highlighting its utility in genomic and clinical studies. C_LI C_TEXTBOX

18
Beyond Lipschitz: Ranking Binding Affinity in Hyperbolic Space

Wu, K.; Hong, X.; Zhu, W.; Gao, B.; Ma, W.-Y.; Lan, Y.

2026-02-03 bioinformatics 10.64898/2026.02.01.703164 medRxiv
Top 0.2%
1.7%
Show abstract

Despite the importance of protein-ligand affinity ranking in drug discovery, existing deep learning models struggle to distinguish hard inactives that are structurally similar to active compounds but biologically inactive. We theoretically show this failure stems from the Lipschitz continuity constraint in Euclidean space, which makes neural networks locally insensitive to subtle yet critical structural perturbations. To overcome this, we propose AlphaRank, which demonstrates that scoring affinity as the negative hyperbolic geodesic distance allows subtle tangential variations to bypass Lipschitz constraints, intuitively corresponding to pivotal binding mode factors such as conformational fit. Meanwhile, AlphaRank employs joint optimization to simultaneously ranking affinity scores among actives in a proximity-aware manner and separate actives from decoy inactives through a geometric cone constraint. Experiments demonstrate that AlphaRank outperforms state-of-the-art models like Boltz-2 in both affinity ranking and active-inactive discrimination, while providing a interpretable representation space.

19
Information Leakage in Enzyme Substrate Prediction

Atabaigi Elmi, V.; Joeres, R.; Kalinina, O. V.

2026-03-01 bioinformatics 10.64898/2026.02.26.708291 medRxiv
Top 0.2%
1.7%
Show abstract

Enzymes are essential catalysts in many cellular processes. Understanding their interactions with small molecules, such as regulators, cofactors, and most importantly, substrates, is crucial for understanding the biochemical processes that occur in cells. Correctly interpreting the roles of small molecules that interact with enzymes is key to elucidating enzyme function. Recently, the field of enzyme-small molecule interaction prediction has gained more interest from computational and, especially, deep-learning methods, and numerous datasets and models with remarkable performances have been published. In this work, we critically examine one of the most popular datasets and three models trained on it, identifying leaked information that may overinflate reported model performance. We show that the inspected models are susceptible to information leakage, and their performance drops to near-random when the leakage is removed.

20
GCN-Mamba: Graph Convolutional Network with Mamba for Antibacterial Synergy Prediction

Su, H.; Liang, Y.; Xiao, W.; Li, H.; Liu, X.; Yang, Z.; Yuan, M.; Liu, X.

2026-03-12 bioinformatics 10.64898/2026.03.10.710738 medRxiv
Top 0.2%
1.7%
Show abstract

The escalating crisis of antimicrobial resistance necessitates novel therapeutic strategies, among which drug combination therapy shows great promise by enhancing efficacy and reducing toxicity. However, identifying effective synergistic pairs from the vast combinatorial space remains experimentally challenging and resource-intensive. To address this, we introduce GCN-Mamba, a deep learning framework that integrates Graph Convolutional Networks (GCN) with the Mamba State Space Model. This architecture captures both local molecular topological structures and global implicit interactions by leveraging Extended 3-Dimensional Fingerprints (E3FP) and bacterial gene expression profiles. Evaluation on a comprehensive dataset demonstrated that GCN-Mamba significantly outperforms classical machine learning models in predictive accuracy. In a targeted case study against Methicillin-resistant Staphylococcus aureus (MRSA), the model successfully rediscovered known synergistic pairs, such as Quercetin and Curcumin, consistent with recent literature. Furthermore, prospective in vitro validation confirmed a novel synergistic combination of Shikimic acid and Oxacillin, validating the models practical utility. By efficiently prioritizing potential candidates, GCN-Mamba serves as a powerful and reliable tool for accelerating the discovery of synergistic antimicrobial combinations, effectively bridging the gap between computational prediction and experimental validation.